A few years ago, I read my first nonfiction self-help book: “The Power of Habit.” I was hooked on how the book provided a frame to manage the randomness of our lives through something as simple as habits. Since then, I haven’t stopped reading self-help books.
The Power of Habit Book
Beyond the self, there was a section in the book that shifted my paradigm on the interaction between businesses and customers. The book highlighted the strategy Target uses to push ads towards their customers. Based on objective features like their gender, age, home address, past purchases, etc, they could identify the habits of their customers and predict when a customer was pregnant months before the customer would even know.
As creepy as that sounds, it’s actually incredibly interesting how an objective factor like data could reveal something about human life. You can read more about it here. Now I won’t be predicting which customers are expecting a baby, but using a similar approach we’re taking objective features to gain a better understanding of our customers.
Imagine we’ve been hired to do some analysis for the new Trader Joe’s of grocery stores: Stellar Harvest. While Trader Joe’s champions unique products and an infamous store layout, Stellar Harvest is making its way up through data-driven marketing strategies to optimize customer experience. Stellar Harvest is about to launch a new marketing campaign and wants to know which customers would accept the offer in the campaign. What do we do? Well as data scientists, we can use machine learning tools to take in data we know about our customers and make some well-informed predictions.
“Stellar Harvest” Logo
Before getting into the analysis, we need to load the proper packages. Each package is like a new function on a swiss army knife. In R, there are hundreds of different packages. We’ll be downloading the ones relevant for data analysis, data visualization, and creating machine learning models.
library(tidyr)
library(lubridate)
library(dplyr)
library(naniar)
library(readxl)
library(caret)
library(themis)
library(readr)
library(vip)
# Model building packages
library(tidymodels)
library(glmnet)
library(discrim)
#Visualization packages
library(ggplot2)
library(corrplot)
library(patchwork)
library(scales)
library(rpart.plot)
# Loading data
df <- read_excel("data/unprocessed/marketing_campaign.xlsx")
head(df)
This data was taken from the Kaggle data set “Marketing Campaign.” The dataset does not specify where the marketing campaign data comes from so we will set it in the context of a pseudo grocery story called Stellar Harvest.
Now that all the packages and data are loaded in, lets take a look at what exactly is in the data. Exploring data is essential to the process as it allows us to build some context on our data. Tidying the data is also essential because it allows us to ensure the data we’re about to analyze is in an ideal format for our analysis goals.
dim(df)
## [1] 2240 29
names(df)
## [1] "ID" "Year_Birth" "Education"
## [4] "Marital_Status" "Income" "Kidhome"
## [7] "Teenhome" "Dt_Customer" "Recency"
## [10] "MntWines" "MntFruits" "MntMeatProducts"
## [13] "MntFishProducts" "MntSweetProducts" "MntGoldProds"
## [16] "NumDealsPurchases" "NumWebPurchases" "NumCatalogPurchases"
## [19] "NumStorePurchases" "NumWebVisitsMonth" "AcceptedCmp3"
## [22] "AcceptedCmp4" "AcceptedCmp5" "AcceptedCmp1"
## [25] "AcceptedCmp2" "Complain" "Z_CostContact"
## [28] "Z_Revenue" "Response"
Our data has 2240 observations (aka customers) and 29 columns (aka
traits for each customer). Our response variable is
Response which we will rename to
AcceptedLastCmp for clarity. The response represents
whether or not a customer accepted the offer in the latest marketing
campaign.
Since Stellar Harvest pushes their campaigns online, it would be
interesting to see the relationship between age/generation and digital
literacy. To simplify the analysis, I’ve added a new column
Generation to display the generation of each customer.
Based on our oldest and youngest customer, our generations will include:
Lost Generation, Greatest Generation, Silent Generation, Baby Boomer
Generation, Generation X, and Millenials.
We’ll also be turning the TotalDays from a date to a
numeric to make it easier to work with.
# Find min and max of birth year column
print(paste("The oldest person was born in ", min(df$Year_Birth), " and the youngest person was born in ", max(df$Year_Birth)))
## [1] "The oldest person was born in 1893 and the youngest person was born in 1996"
# Create new column with generations
df$Generation <- ifelse(df$Year_Birth < 1901, "Lost Generation",
ifelse(df$Year_Birth < 1928, "Greatest Generation",
ifelse(df$Year_Birth < 1946, "Silent Generation",
ifelse(df$Year_Birth < 1965, "Baby Boomer Generation",
ifelse(df$Year_Birth < 1981, "Generation X",
ifelse(df$Year_Birth < 1997, "Millenials"))))))
Stellar Harvest is interested in knowing whether a customer will
accpet the latest marketing campaign so it would also be helpful to know
whether a customer has accepted an offer in the past. To help us dissect
this information, lets create a new variable called
AcceptedPrv which we can call on for our EDA.
df <- df %>%
mutate(AcceptedPrv = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5)
Z_CostContact and Z_Revenue respectively
represent the cost to contact a customer and revenue after accepting
campaign. Since we’re concerned with predicting whether a customer will
accept the offer on the last campaign, that information won’t be useful.
Additionally, we aren’t concerned with the ID of customers
so we can remove that as well.
# Remove unecessary columns
df$Z_CostContact <- NULL
df$Z_Revenue <- NULL
df$ID <- NULL
length(colnames(df))
## [1] 28
Another aspect of tidying data is checking the type of each
predictor. In our data, we have a lot of categorical variables like
Education, Marital_Status, etc. This
distinction between categorical and numeric variables is important
because that will impact our model. To ensure our model can take in the
data properly, we must factorize our categorical variables.
# Change to factor type
df$Education <- as.factor(df$Education)
df$Marital_Status <- as.factor(df$Marital_Status)
df$Complain <- as.factor(df$Complain)
df$Generation <- factor(df$Generation, levels=c("Lost Generation", "Greatest Generation", "Silent Generation", "Baby Boomer Generation", "Generation X", "Millenials"))
df$AcceptedCmp1 <- as.factor(df$AcceptedCmp1)
df$AcceptedCmp2 <- as.factor(df$AcceptedCmp2)
df$AcceptedCmp3 <- as.factor(df$AcceptedCmp3)
df$AcceptedCmp4 <- as.factor(df$AcceptedCmp4)
df$AcceptedCmp5 <- as.factor(df$AcceptedCmp5)
df$AcceptedLastCmp <- as.factor(df$AcceptedLastCmp)
head(df)
The variables we’ll be working with are:
Year_Birth: customer’s year of birthEducation: customer’s level of educationMarital Status: customer’s marital statusIncome: customer’s yearly household incomeKidhome: number of small children in customer’s
householdTeenhome: number of teenagers in customer’s
householdRecency: number of days since the last purchaseComplain: 1 if customer complained in the last 2
yearsMntWines: amount spent on wine products in the last 2
yearsMntFruits: amount spent on fruits products in the last
2 yearsMntMeatProducts: amount spent on meat products in the
last 2 yearsNumWebPurchases: number of purchases made through
company’s web siteNumCatalogPurchases: number of purchases made using
catalogueNumStorePurchases: number of purchases made directly in
storesNumWebVisitsMonth: number of visits to company’s web
site in the last monthAcceptedPrv: previously accepted a marketing
campaignAcceptedCmp1: 1 if customer accepted the offer in the
1st campaign, 0 otherwiseAcceptedCmp2: 1 if customer accepted the offer in the
2nd campaign, 0 otherwiseAcceptedCmp3: 1 if customer accepted the offer in the
3rd campaign, 0 otherwiseAcceptedCmp4: 1 if customer accepted the offer in the
4th campaign, 0 otherwiseAcceptedCmp5: 1 if customer accepted the offer in the
5th campaign, 0 otherwiseAcceptedLastCmp(response): 1 if customer accepted the
offer in the latest campaign, 0 otherwiseLet’s also save our tidied data set locally just in case we want to call it later.
clean_df <- df
write.csv(clean_df, file="data/processed/cleaned_marketing_campaign.csv")
Now we can start exploring the data in more depth. This step is crucial because again, it allows us to better understand the context of our data. Through data visualizations, we can get a high level sense of certain patterns in our data.
First, lets explore our variables and check for missing data using the following plot.
vis_miss(df)
The only missing data occurs in our Income predictor and
it isn’t a significant amount so it shouldn’t impact our predictive
model. Since there are only 24 individuals with missing data on
Income, we can drop those values. As a sanity check to make
sure all our missing data is indeed gone, we can run the code below.
df <- na.omit(df)
sum(is.na(df))
## [1] 0
Now, time for the fun part! Lets take a look at some visualizations of our data to gain a better understanding of our customers.
Since we’re working with a lot of predictors, it would be useful to
see the distribution of each predictor. From the plots, we can
understand the general pattern in our customers. For example, all the
food products generally follow the same pattern where it exponentially
decreases as the amount gets higher. Another interesting pattern is that
the Recency variable has no clear growth and resembles a
uniform distribution which means we have a well-represented number of
customers from recently buying from Stellar Harvest to not having bought
anything in a while.
# Select numeric columns
df_num <- df %>% select_if(is.numeric)
# Convert data into long form
df_num_long <- gather(df_num)
# Create histogram
ggplot(df_num_long, aes(x=value)) +
geom_histogram(fill="maroon") +
facet_wrap(~key, scales="free")
Stellar Harvest is interested in seeing the age demographic of their customers as that may play a role in whether or not they accept the latest marketing campaign.
year_plot <- ggplot(df, aes(Year_Birth)) +
geom_density(color="darkred", fill="pink") +
geom_vline(aes(xintercept=mean(Year_Birth)), color='blue', linetype='dashed', size=1)
generation_plot <- ggplot(df, aes(Generation)) +
geom_histogram(stat="count", color="darkred", fill="pink") +
theme(axis.text.x = element_text(angle=45, hjust=1))
grid.arrange(year_plot, generation_plot, ncol=2)
Most of our customers are Baby Boomer’s and Generation X with very few in the Lost and Silent Generation. Our average customer is born in 1968 which falls into Generation X. This is useful information as Stellar Harvest could aim to market content that appeals to their main generational customers or work on increasing engagement in other generations.
gen_prop_last <-ggplot(df, aes(fill=AcceptedLastCmp, x=Generation)) +
geom_bar(position=position_dodge()) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(y='Count', x='Generation', title='Number of Customers From Each Generation Accepting the Latest Campaign') +
scale_fill_manual(values=c('pink','maroon'))
gen_prop_last
While none of the Lost Generation accepted the offer in the latest marketing campaign, we also need to acknowledge that there are only three customers from the Lost Generation. Pivoting our attention to the more popular generations like Baby Boomer and Generation X, whom we have more data for, we can see about the same proportion of customers would accept the latest offer. An interesting note is that Millenials seem to be more keen to accept the offer.
I would like to hypothesize that Income may play a
significant role in a customers purchasing habits so lets take a look at
how the income of our customer varies.
df_long_income <- df %>%
dplyr::select(Income) %>%
gather(key = "Variable", value = "Value")
ggplot(df_long_income, aes(x=Variable, y=Value)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title="Distribution of Income", y="Income", x="Customers") +
scale_y_continuous(labels=dollar)
Most of our customers make 40,000 to 70,000 dollars. But we do have one really rich customer. Just for fun, lets take a look at their characteristics. We can see that the customer is well-educated, has only one kid, and hasn’t accepted a single offer from any of our marketing campaigns…makes sense. Since most of our customers do not make $600,000+, this wealthy customer may throw off our model so we’ll remove them.
# Find rich person
df[which.max(df$Income), ]
outliers <- boxplot(df$Income, plot = FALSE)$out
# Filter out rich person
df <- df %>%
filter(Income < max(outliers)-1)
We don’t know the exact offer that Stellar Harvest is pushing for each campaign but it would be interesting to see how previous campaigns would impact the latest one.
cmp1_prop <-ggplot(df, aes(fill=AcceptedLastCmp, x=AcceptedCmp1)) +
geom_bar(position='fill') +
labs(y='proportion')
cmp2_prop <- ggplot(df, aes(fill=AcceptedLastCmp, x=AcceptedCmp2)) +
geom_bar(position='fill') +
labs(y='proportion')
cmp3_prop <- ggplot(df, aes(fill=AcceptedLastCmp, x=AcceptedCmp3)) +
geom_bar(position='fill') +
labs(y='proportion')
cmp4_prop <- ggplot(df, aes(fill=AcceptedLastCmp, x=AcceptedCmp4)) +
geom_bar(position='fill') +
labs(y='proportion')
cmp5_prop <- ggplot(df, aes(fill=AcceptedLastCmp, x=AcceptedCmp5)) +
geom_bar(position='fill') +
labs(y='proportion')
cmp1_prop + cmp2_prop + cmp3_prop + cmp4_prop + cmp5_prop + plot_layout(ncol = 2)
An obvious pattern is that those who have accepted an offer in a previous marketing campaign are more likely to accept the latest marketing campaign. Between those who accepted a campaign from 1-5, it isn’t extremely clear which of the 5 would result in accepting the last campaign. Although those who accepted campaign 2 seems to have the highest proportion of also accepting the last campaign. It would be interesting to see the marketing tactics used in campaign 2.
#list of products
products <- c('MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds')
# Sum amounts spend on products
products_df <- df %>%
select(products) %>% summarize_each(sum) %>%
t() %>% as.data.frame() %>%
rownames_to_column('Products')
# Clean up names
colnames(products_df) <- c('Products', 'Sums')
products <- gsub('Products', '', gsub(c('Mnt'), '', products))
#creating pie chart
ggplot(products_df, aes(x = '', y = Sums, fill = Products)) +
geom_bar(stat = 'identity', width = 1, color = 'black') +
geom_text(aes(label = paste('$', Sums)), color = 'white', position = position_stack(vjust = 0.5)) +
coord_polar('y', start = 0) +
labs(title = 'Percentage of Total Sales from Products', fill = 'Products',
caption = paste('Total Revenue: $', sum(products_df$Sums)))
In the pie chart we can see that the product that makes up the most sales at Stellar Harvest is wine. This is interesting as wine is expensive but not necessarily an item as frequently bought as fruit. I suppose the tradeoff between price and frequency play a role in Stellar Harvest products.
Now, lets look at the relationship between all the numeric variables in our data.
df %>%
dplyr::select(where(is.numeric)) %>%
cor() %>%
corrplot()
The triangle in the middle of the correlation plot formed from the
amount of food products bought by customers all are positively
correlated. This is an interesting pattern as it suggests when a
customer buys more of one product, they are likely to buy more of
others. Our previous hypothesis on Income also seems to be
supported in the plot where Income is positively correlated
with the amount of food product purchases in all categories, web
purchases, catalogue purchases, and store purchases. However,
Income is negatively correlated with the number of web
visits per month. There seems to be a negative correlation between the
amount of times a customer visits the website compared to the amount of
food products they purchase in all the food categories.
Before we train our model, we need to split it into two parts: training and testing. The training set will be used to train our model and testing will be used to test the accuracy of our model. We can proceed by splitting the data we have into two proportions, usually leaning heavier on the training data. For our data, we will allocate 70% of the data to be training.
What if we end up with a training set with all customers who did not accept the last offer and a testing set with customers who did accept the last offer? To address this, we will be using stratified sampling. This ensures that the proportion of customers making up the training and testing set individually will not be heavily biased.
# Set seed
set.seed(712)
# Split data
df_split <- initial_split(df, prop=0.7, strata=AcceptedLastCmp)
df_train <- training(df_split)
df_test <- testing(df_split)
To ensure our data was split properly, lets check the dimensions.
dim(df_train)
## [1] 1550 28
dim(df_test)
## [1] 665 28
df_train %>%
group_by(AcceptedLastCmp) %>%
summarise(prop=n()/(dim(df_train)[1]))
df_test %>%
group_by(AcceptedLastCmp) %>%
summarise(prop=n()/(dim(df_test)[1]))
Given that only 15% accept the offer in the last campaign, it will be difficult for the model to learn what customers accept campaigns given the limited data. However, we can avoid this through upsampling which we’ll get to in the next step.
Recipes allow us to create a standardized set of steps to preprocess our data. This involves addressing missing data, normalizing our data, dealing with imbalances in data, etc. Since we removed the missing data earlier, we will not need to call any imputations in our recipe. Normalizing data is an important preprocessing step because it allows the data from different predictors to have similar ranges which can help reduce bias. Another step we’ll be taking is removing features with zero variance because zero variance means that the feature has the same values for all samples (this is not a crucial step as it’s clear from our EDA that all our features do not have zero variance).
As mentioned above, we have an imbalanced proportion in our response variable so we will upsampling. Upsampling will add more copies of existing data and allow us to achieve a more balanced dataset. Something we will have to keep in mind is that this may result to overfitting, but lets see how our model performs later on.
For our model, we will be using 23 predictors.
df_recipe <- recipe(AcceptedLastCmp ~ Year_Birth + Education + Generation + TotalDays + Marital_Status + Income + Recency + MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + AcceptedCmp1 + AcceptedCmp2 + Complain, data = df_train) %>%
# dummy coding nominal predictors
step_dummy(all_nominal_predictors()) %>%
# remove zero variance
step_zv(all_predictors()) %>%
# standardizing predictors
step_scale(all_predictors()) %>%
step_center(all_predictors()) %>%
# balancing data
step_upsample(AcceptedLastCmp, over_ratio=1, skip=TRUE)
prep(df_recipe) %>% bake(new_data=df_train)
We will be creating ten folds to train our model. This means that ten “versions” of our data will be used to train our model. In each version, we will split it into ten sets where nine of them will act as our training and the remaining one will act as the validation set. Cross validation is extremely helpful because it allows us to make the most of our given data and potentially uncover patterns that we would not be able to see if we just trained on the whole thing once.
Note: Animation uses n=8 where we will use n=10 folds
K-Fold cross validation is advantageous over simply testing on our
training and testing set once because it allows us to identify patterns
in the data that may not have been identifiable otherwise. With 10
folds, we can take the average accuracy across all 10 folds to obtain a
metric that has less variation than simply doing it once. Again we’re
using stratifying here on AcceptedLastCmp to make sure our
data is balanced.
df_folds <- vfold_cv(df_train, v=5, strata=AcceptedLastCmp)
We’ve been talking about the model all this time and now it’s finally time to build the model! To be efficient, the results of each model will be saved as a file so we won’t need to rerun it every time. Since we’re working with a classification problem, we’ll be using the area under the ROC curve (ROC AUC) as our metric. ROC AUC takes into account sensitivity (true positive rate) and specificity (false positive rate) across different threshold values.
# Logistic regression
log_reg_mod <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_reg_wflow <- workflow() %>%
add_model(log_reg_mod) %>%
add_recipe(df_recipe)
# LDA
lda_mod <- discrim_linear() %>%
set_engine('MASS') %>%
set_mode("classification")
lda_wflow <- workflow() %>%
add_model(lda_mod) %>%
add_recipe(df_recipe)
# KNN
knn_mod <- nearest_neighbor(neighbors=tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_wflow <- workflow() %>%
add_model(knn_mod) %>%
add_recipe(df_recipe)
# Elastic net
en_mod <- logistic_reg(mixture=tune(), penalty=tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
en_wflow <- workflow() %>%
add_model(en_mod) %>%
add_recipe(df_recipe)
# Decision tree
tree_mod <- decision_tree(cost_complexity=tune()) %>%
set_engine('rpart') %>%
set_mode('classification')
tree_wflow <- workflow() %>%
add_model(tree_mod) %>%
add_recipe(df_recipe)
# Random forest
rf_mod <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_wflow <- workflow() %>%
add_model(rf_mod) %>%
add_recipe(df_recipe)
knn_grid <- grid_regular(neighbors(range=c(1,10)), levels=10)
en_grid <- grid_regular(penalty(range=c(-3, 7)), mixture(range=c(0,1)), levels=10)
tree_grid <- grid_regular(cost_complexity(range = c(-2, 8)), levels = 10)
rf_grid <- grid_regular(mtry(range=c(1,15)), trees(range=c(50,200)), min_n(range=c(1,8)), levels = 8)
# Logistic regression
log_reg_tune <- tune_grid(
log_reg_wflow,
resamples=df_folds
)
# LDA
lda_tune <- tune_grid(
lda_wflow,
resamples=df_folds
)
# KNN
knn_tune <- tune_grid(
knn_wflow,
resamples=df_folds,
grid=knn_grid
)
# Elastic net
en_tune <- tune_grid(
en_wflow,
resamples=df_folds,
grid=en_grid
)
# Decision tree
tree_tune <- tune_grid(
tree_wflow,
resamples=df_folds,
grid=tree_grid
)
# Random forest
rf_tune <- tune_grid(
rf_wflow,
resamples=df_folds,
grid=rf_grid
)
# Logistic regression
## No tuning
# LDA
## No tuning
# KNN
write_rds(knn_tune, file="models/knn.rds")
# Elastic net
write_rds(en_tune, file="models/elastic.rds")
# Decision tree
write_rds(tree_tune, file="models/tree.rds")
# Random forest
write_rds(rf_tune, file="models/forest.rds")
knn_tuned <- read_rds(file = "models/knn.rds")
en_tuned <- read_rds(file="models/elastic.rds")
tree_tuned <- read_rds(file="models/tree.rds")
rf_tuned <- read_rds(file="models/forest.rds")
# Logistic regression
log_reg_fit <- fit(log_reg_wflow, data=df_train)
# Elastic net
best_en <- select_best(en_tuned, metric='roc_auc', penalty, mixture)
en_final <- finalize_workflow(en_wflow, best_en)
en_fit <- fit(en_final, data=df_train)
# KNN
best_knn <- select_best(knn_tuned, metric='roc_auc', neighbors)
knn_final <- finalize_workflow(knn_wflow, best_knn)
knn_fit <- fit(knn_final, data=df_train)
# LDA
best_lda <- select_best(lda_tune)
lda_final <- finalize_workflow(lda_wflow, best_lda)
lda_fit <- fit(lda_final, data=df_train)
# Tree
best_complexity <- select_best(tree_tuned)
tree_final <- finalize_workflow(tree_wflow, best_complexity)
tree_fit <- fit(tree_final, data=df_train)
# Random forest
best_rf <- select_best(rf_tuned)
rf_final <- finalize_workflow(rf_wflow, best_rf)
rf_fit <- fit(rf_final, data=df_train)
# Logistics regression
log_roc_auc <- augment(log_reg_fit, new_data=df_train) %>%
roc_auc(truth=AcceptedLastCmp, estimate=.pred_0)
# KNN
knn_roc_auc <- augment(knn_fit, new_data=df_train) %>%
roc_auc(truth=AcceptedLastCmp, estimate=.pred_0)
# Elastic net
en_roc <- augment(en_fit, new_data=df_train) %>%
roc_auc(truth=AcceptedLastCmp, estimate=.pred_0)
# LDA
lda_roc_auc <- augment(lda_fit, new_data=df_train) %>%
roc_auc(truth=AcceptedLastCmp, estimate=.pred_0)
# Tree
tree_roc_auc <- augment(tree_fit, new_data=df_train) %>%
roc_auc(truth=AcceptedLastCmp, estimate=.pred_0)
# Random forest
rf_roc_auc <- augment(rf_fit, new_data=df_train) %>%
roc_auc(truth=AcceptedLastCmp, estimate=.pred_0)
After all that time generating each model, which one performed the best?
# Create tibble of all best values
compare_models <- tibble(Model=c("Logistic Regression", "Elastic Net", "K Nearest Neighbors", "LDA", "Decision Tree", "Random Forest"), ROC_AUC=c(log_roc_auc$.estimate, en_roc$.estimate, knn_roc_auc$.estimate, lda_roc_auc$.estimate, tree_roc_auc$.estimate, rf_roc_auc$.estimate))
compare_models <- compare_models %>% arrange(-ROC_AUC)
compare_models
Based on the performance of the models on the cross-validation data, it’s clear that our winner is the random forest model! K Nearest Neighbors and Logistic Regression also performed well. I’m surprised that the Decision Tree performed so poorly, even worse than LDA and Elastic Net.
To get a better sense of how our models performed, we can use some built-in R visualizations that will allow us to see how the performance varies under our specified grid.
autoplot(tree_tuned)
Our decision tree begins to decrease as soon as the cost-complexity parameter is above 0. Given the ease of interpretability of decision trees, lets visualize the tree for fun.
tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
It’s interesting to see the factors that our tree favored like
TotalDays, Recency, Income, and
MaritalStatus. While this results in a cool visual, our
decision tree was unfortunately not our best performing model.
autoplot(en_tuned)
In the elastic net, we tuned penalty and mixture. The autoplot displays that the ROC AUC is relatively stable for all proportions of Lasso penalty until the amount of regularization reaches 1e-02. After 1e-02, the proportions with the smallest decrease are 0.0 and 0.11.
autoplot(rf_tuned)
All the node sizes have relatively similar ROC AUC trends. It is difficult to discern which one is the best visually but overall they all seem to have a significant increase when we reach 4 randomly selected predictors.
autoplot(knn_tuned)
The KNN continues to increases in ROC AUC as we increase our number of nearest neighbors. It would be interesting to see what happens if we went beyond 10 neighbors but given the accuracy of our KNN model is already relatively high, sticking with just 10 neighbors will do for simplicity.
Since our random forest model had the best overall performance, let’s take a closer to see which of the random forest models is the best. Our best performer is Random Forest 291! Below is the model’s output and associated parameters.
rf_tuned %>%
collect_metrics() %>%
arrange(desc(mean)) %>%
slice(1)
Drumroll, please. For it’s now time to see our model’s true ROC AUC performance on our testing dataset!
# Fit data
final_rf_model_test <- augment(rf_fit, new_data=df_test) #%>%
#select(AcceptedLastCmp, starts_with(".pred"))
roc_auc(final_rf_model_test,truth = AcceptedLastCmp, estimate = .pred_0)
roc_curve(final_rf_model_test,truth = AcceptedLastCmp, estimate = .pred_0) %>%
autoplot() +
labs(title="ROC Curve of our Random Forest #291")
The ROC AUC is 0.886 and the ROC curve does not look too shabby! Given our ROC AUC is above 0.8, it’s considered a relatively fair score and that means our model did a good job. While we have all this data on a customer’s buying habits, humans are ultimately quite random creatures so a ROC of 0.886 would be helpful in providing data-driven insight for Stellar Harvest on their customers.
Stepping out of Random Forest 291, it would be interesting to see which predictors the Random Forest model values the most to give us some more context. We can see this through the VIP value which measures the importance of each variable. The higher the VIP value, the more important it is in explaining the variation in the response variable.
rf_fit %>% extract_fit_parsnip() %>%
vip() +
theme_minimal() +
labs(title="Top 10 Predictors w/ Highest VIP")
For our Random Forest, it seems to have valued Recency,
TotalDays, and Income the most.
Recency makes sense as it tells us when the last time a
customer purchased something at the store was. And
TotalDays gives us a sense of customer loyalty (how long
they’ve been with the store). Together, these two variables would be
helpful in revealing loyal and active customers. Also, it’s interesting
to see that of all the products, the amount of wine is ranked as the
most important. Wine also happens to be the product that consumes most
of Stellar Harvet’s product income.
Diving back into Random Forest 291, we can further visualize how well it performed through a confusion matrix. Luckily, the ROC AUC value is able to incorporate knowledge on False Positive and True Negatives when assessing our model. Overall, it seems most of our predictors were True Positive and True Negatives.
conf_mat(final_rf_model_test,truth = AcceptedLastCmp, estimate = .pred_class) %>%
autoplot(type='heatmap')
Out of the models that I have tested, the random forest model performed the best, while the decision tree model performed the worst.
This outcome is not entirely surprising since random forests tend to perform well in predicting outcomes They are also very versatile and can handle a variety of data types and structures, making them an ideal choice for classification problems. Decision trees, on the other hand, are simpler models that can suffer from overfitting, especially when the data is noisy or contains many features.
No model is perfect, and it’s expected that some models will perform better than others in different scenarios. Model performance can also vary depending on the quality and quantity of data, feature selection, and model tuning.
In terms of next steps, I’d want to explore whether other models like support vector machines would perform better than the random forest. Given the limited computational power of my laptop, I wasn’t able to go extremely deep into tuning a variety of parameters so next time it would be beneficial to spend more time feature engineering to improve the quality of the data and boost model performance.
Overall, the results suggest that the random forest model is the most suitable for predicting whether a customer will accept the offer in the latest marketing campaign of “Stellar Harvest.”